Introduction

This project is part of the Udacity’s Data Analyst Nanodegree Program. I conducted an Exploratory Data Analysis (EDA) on one of the curated data sets, provided by Udacity, from Prosper, which is America’s first marketplace lending platform, with over $12 billion in funded loans as of June 2021.

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The dataset contains loans created between 2005-Q4 and 2014-Q1 with a last updated date of 03/11/2014. Here’s a link to the variable definitions for this dataset.

The main purpose of this project is to summarize the characteristics of variables that can affect the loan status and to get some ideas about the relationships among multiple variables using summary statistics and data visualizations.

Initial Dataset Exploration

First, the following are the object names of the dataset.

##  [1] "ListingKey"                          "ListingNumber"                      
##  [3] "ListingCreationDate"                 "CreditGrade"                        
##  [5] "Term"                                "LoanStatus"                         
##  [7] "ClosedDate"                          "BorrowerAPR"                        
##  [9] "BorrowerRate"                        "LenderYield"                        
## [11] "EstimatedEffectiveYield"             "EstimatedLoss"                      
## [13] "EstimatedReturn"                     "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."               "ProsperScore"                       
## [17] "ListingCategory..numeric."           "BorrowerState"                      
## [19] "Occupation"                          "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"            "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                    "GroupKey"                           
## [25] "DateCreditPulled"                    "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"               "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                  "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"          "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"         "InquiriesLast6Months"               
## [35] "TotalInquiries"                      "CurrentDelinquencies"               
## [37] "AmountDelinquent"                    "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"            "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"              "BankcardUtilization"                
## [43] "AvailableBankcardCredit"             "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage."  "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                   "IncomeRange"                        
## [49] "IncomeVerifiable"                    "StatedMonthlyIncome"                
## [51] "LoanKey"                             "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"          "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate" "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"            "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"         "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"       "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                          "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                 "LoanOriginationQuarter"             
## [67] "MemberKey"                           "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                 "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                  "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                   "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                 "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                       "Recommendations"                    
## [79] "InvestmentFromFriendsCount"          "InvestmentFromFriendsAmount"        
## [81] "Investors"

Prosper Loan Data contains 113,937 observations and 81 variables.

Next, the following is the initial dataframe’s structure.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : chr  "1021339766868145413AB3B" "10273602499503308B223C1" "0EE9337825851032864889A" "0EF5356002482715299901A" ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : chr  "2007-08-26 19:09:29.263000000" "2014-02-27 08:28:07.900000000" "2007-01-05 15:00:47.090000000" "2012-10-22 11:02:35.010000000" ...
##  $ CreditGrade                        : chr  "C" "" "HR" "" ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : chr  "Completed" "Current" "Completed" "Current" ...
##  $ ClosedDate                         : chr  "2009-08-14 00:00:00" "" "2009-12-17 00:00:00" "" ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : chr  "" "A" "" "A" ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : chr  "CO" "CO" "GA" "GA" ...
##  $ Occupation                         : chr  "Other" "Professional" "Other" "Skilled Labor" ...
##  $ EmploymentStatus                   : chr  "Self-employed" "Employed" "Not available" "Employed" ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : chr  "True" "False" "False" "True" ...
##  $ CurrentlyInGroup                   : chr  "True" "False" "True" "False" ...
##  $ GroupKey                           : chr  "" "" "783C3371218786870A73D20" "" ...
##  $ DateCreditPulled                   : chr  "2007-08-26 18:41:46.780000000" "2014-02-27 08:28:14" "2007-01-02 14:09:10.060000000" "2012-10-22 11:02:32" ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : chr  "2001-10-11 00:00:00" "1996-03-18 00:00:00" "2002-07-27 00:00:00" "1983-02-28 00:00:00" ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : chr  "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
##  $ IncomeVerifiable                   : chr  "True" "True" "True" "True" ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : chr  "E33A3400205839220442E84" "9E3B37071505919926B1D82" "6954337960046817851BCB2" "A0393664465886295619C51" ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : chr  "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
##  $ LoanOriginationQuarter             : chr  "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
##  $ MemberKey                          : chr  "1F3E3376408759268057EDA" "1D13370546739025387B2F4" "5F7033715035555618FA612" "9ADE356069835475068C6D2" ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Next, I will use a subset of this dataset for performing an analysis. The main purpose of this analysis is to explore the various factors that affect borrowers’ rates. The following are the variables I would like to use.

##  [1] "LoanNumber"                "LoanOriginalAmount"       
##  [3] "LoanOriginationDate"       "LoanOriginationQuarter"   
##  [5] "Term"                      "LoanStatus"               
##  [7] "BorrowerRate"              "LenderYield"              
##  [9] "ProsperRating..Alpha."     "ProsperScore"             
## [11] "ListingCategory..numeric." "EmploymentStatus"         
## [13] "IsBorrowerHomeowner"       "CreditScoreRangeLower"    
## [15] "CreditScoreRangeUpper"     "InquiriesLast6Months"     
## [17] "PublicRecordsLast10Years"  "IncomeRange"              
## [19] "IncomeVerifiable"

Based on the variables in this dataset, I will transform Term and ListingCategory columns from ‘numeric’ to ‘factor’. Then, I will transform the ListingCreationDate column from ‘factor’ to ‘Date’. The following is the dataset’s structure transformed.

## 'data.frame':    113937 obs. of  19 variables:
##  $ LoanNumber              : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount      : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate     : chr  "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
##  $ LoanOriginationQuarter  : chr  "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
##  $ Term                    : Factor w/ 3 levels "1 year","3 years",..: 2 2 2 2 2 3 2 2 2 2 ...
##  $ LoanStatus              : chr  "Completed" "Current" "Completed" "Current" ...
##  $ BorrowerRate            : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield             : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ ProsperRating           : Factor w/ 8 levels "AA","A","B","C",..: NA 2 NA 2 5 3 6 4 1 1 ...
##  $ ProsperScore            : Factor w/ 11 levels "1","2","3","4",..: NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory         : Factor w/ 21 levels "NA","Debt Consolidation",..: 1 3 1 17 3 2 2 3 8 8 ...
##  $ EmploymentStatus        : chr  "Self-employed" "Employed" "Not available" "Employed" ...
##  $ IsBorrowerHomeowner     : chr  "True" "False" "False" "True" ...
##  $ CreditScoreRangeLower   : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper   : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ InquiriesLast6Months    : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ PublicRecordsLast10Years: int  0 1 0 0 0 0 0 1 0 0 ...
##  $ IncomeRange             : chr  "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
##  $ IncomeVerifiable        : chr  "True" "True" "True" "True" ...

Univariate Plots Section

In this section, I will perform some preliminary exploration of the transformed dataset. I will run some summaries of the data and create univariate plots to understand the structure of the individual variables in this dataset.

Categorical Variables

First, let’s take a look on various categorical variables starting with Term.

## .
##     1 year    3 years    5 years 
## 0.01416572 0.77040821 0.21542607

Above 90% of the loans in this dataset are with term more than 1 year. Most of them is of 3 years.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2005    2008    2012    2011    2013    2014

The chart took a plunge in 2009 and exploded the year after peaking in 2013.

## .
##              Cancelled             Chargedoff              Completed 
##           0.0000438839           0.1052511476           0.3341671274 
##                Current              Defaulted FinalPaymentInProgress 
##           0.4965551138           0.0440418828           0.0017992399 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##           0.0001404285           0.0070740848           0.0023258467 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##           0.0031859712           0.0027471322           0.0026681412

There are four major loan status: “Defaulted”, “Current”, “Completed”, and “Chargedoff.”

## .
##         AA          A          B          C          D          E         HR 
## 0.06330949 0.17148480 0.18362344 0.21619742 0.16822033 0.11543493 0.08172958 
##         NA 
## 0.00000000

The above bar chart shows Prosper’s rating distribution. The unrated loans are removed in this plot.

## .
##                    Employed     Full-time Not available  Not employed 
##   0.019791639   0.590870393   0.231312041   0.046929443   0.007328611 
##         Other     Part-time       Retired Self-employed 
##   0.033404425   0.009549137   0.006977540   0.053836769

The majority of Prosper users are either employed or have a full-time job.

About half of the users are homeowners.

## .
##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##    0.005450380    0.063842299    0.152163037    0.282542107    0.272519024 
## $75,000-99,999  Not displayed   Not employed 
##    0.148468013    0.067941055    0.007074085

About 70% of borrowers’ income are in the range of $25,000 - $100,000

Most of borrowers have their income verified.

Numerical Variables

Next, let’s take a look on various numerical variables starting with BorrowerRate.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

The BorrowerRate has a right-skewed distribution, with a mean of 19.28%, and median of 18.40%. Similarly, the LenderYield also has a right-skewed distribution, with a mean of 18.27%, and median of 17.30%.

The ranges of credit scores are small compared to the distributions. I will use lower credit score for further analyses.

##  [1]   3   0   1   7   2   5   4  11   6   9  44  NA   8  12  19  42  27  10  15
## [20]  13  21  17  38  16  14  18  22  25  28  20  23  26  29  24  40  41  30  33
## [39]  31  32  34  35  36  63  97  37  46  52 105  53  50

The distribution of InquiriesLast6Months is right-skewed and long-tailed. Outliers like 105 and 53 exist.

##  [1]  0  1  2  5 NA  3  4  7  6 11  8 12  9 10 15 21 13 25 38 14 16 30 20 34 22
## [26] 17

The distribution of PublicRecordsLast10Years is right-skewed and has some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

LoanOriginalAmount also displays a long-tailed distribution, with some numbers especially higher. The reason why these numbers are more frequent might be because they are the exact numbers (eg. 10,000, 15,000, 20,000) that people tend to choose with a loan original amount minimum of $1,000 and maximum of $35,000

Univariate Analysis

What is the structure of your dataset?

This is a very large dataset with 113937 records and 81 variables on each loan and covers the period from 2005 through 2014. For simplicity with the analysis, I use a subset of the original dataset containing the same number of records, but only with 15 variables that I want to explore. The 15 variables are as follows:

Categorical:

* Term
* LoanStatus
* ProsperRating
* ListingCategory
* EmploymentStatus
* IsBorrowerHomeowner
* IncomeRange
* IncomeVerifiable

Numerical:

* BorrowerRate
* LenderYield
* CreditScoreRangeLower
* CreditScoreRangeUpper
* InquiriesLast6Months
* PublicRecordsLast10Years
* LoanOriginalAmount

What is/are the main feature(s) of interest in your dataset?

The main features of interest in this dataset are BorrowerRate and ProsperRating, where ProsperRating might be the overall assessment and main quality indicators of borrowers performed by Prosper. Also, I’ve seen evidence that LenderYield is the most important factor for investors.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Features om the dataset that I think will help support my investigation into features of interest are variables regarding borrower’s status (i.e.EmploymentStatus, IsBorrowerHomeowner, etc.) as well as those that reveal borrower’s past payment history (i.e. PublicRecords, Inquiries, etc.). Also, I did not find anything unusual about IncomeRange and IncomeVerifiable, variables that I was expecting more from. On the other hand, maybe this variable have more to tell when related to others.

Did you create any new variables from existing variables in the dataset?

I factorized a few variables because those variables are either intrinsically categorical or they only have several possible values.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Of the features you investigated, I observed some unusual distributions in the LoanOriginalAmount histogram. The frequencies are higher around 4,000, 5,000, 10,000, 15,000 and so on. I believe these amounts are more common for most borrowers to choose from.

I did not perform any operations on the dataset, as the data was tidy, Also, I did not perform other transformation other than changing their classes (e.g. numeric -> factor ) since categorical variables enter into statistical models differently than continuous variables. Thus, storing data as factors ensures that the modeling functions will treat such data accordingly.


Bivariate Plots Section

The correlation matrix shows that there are two strong positive relationships among the features. The trivial features are CreditScoreRangeUpper and CreditScoreRangeLower. Next are the BorrowerRate and LenderYield.

## [1] 0.9992113

The BorrowerRate and the LenderYield have a near perfect linear relationship. This is a result from Prosper charging a fixed rate on loans. Therefore, the BorrowerRate and the LenderYield should have a strong positive relationship.

This boxplot shows that ProsperRating does affect the BorrowerRate and its relationship.

Now, I want to inspect other factors that also have impact on the BorrowerRate and the ProsperRating. Let’s look at the BorrowerRate next.

## $`1 year`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.0929  0.1434  0.1501  0.2064  0.2669 
## 
## $`3 years`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1274  0.1815  0.1935  0.2599  0.4975 
## 
## $`5 years`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0669  0.1490  0.1870  0.1930  0.2319  0.3304

The average rate of the 3-year term is larger than the 1-year term. However, these terms are very close to the 5-year term average. Also, loans with a 5-year term average have a smaller range than the 3-year term group.

## [1] -0.3289599

This graph shows that there is a fairly weak negative relationship between BorrowerRate and OriginalLoanAmount.

It seems that some categories have lower rates and others have higher rates.

Borrowers that are income verifiable or homeowners seem to have lower rates, which makes sense.

Finally, let’s take a look at how ProsperRating affect the aforementioned features.

##    AA     A     B     C     D     E    HR    NA  NA's 
##  5372 14551 15581 18345 14274  9795  6935     0 29084

The majority of loans in this dataset are non-classified (i.e.’NA’), but, among those that are classified, loans with a ‘C’ rating is the highest.

A higher credit score by the consumer rating agency also tends to have a higher ProsperRating.

For top ProsperRating groups like “AA” and “A”, Homeowners are the majority.

This chart shows evidence that it is difficult to get a good rating if a borrower’s income not verifiable

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

First, the LenderYield and BorrowerRate are strongly related because the fixed rate charged by Prosper. Second, the BorrowerRate and ProsperRating do vary with borrower status. In addition, a good relationships I found was through the proprietary Prosper Rating system.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The relationship between ListCategory and BorrowerRate is interesting. In average, for this relationship, cosmetic procedure’s rates are relatively higher, whereas boat rates are lower.

What was the strongest relationship you found?

The strongest relationship I found is the one between the LenderYield and BorrowerRate, with a correlation coefficient very close to 1. In addition, the relationship between ProsperRating / LenderYield and ProsperRating / BorrowerRate have an inverse relationship. The higher the rating, the lower the LenderYield and BorrowerRate.


Multivariate Plots Section

Now let’s put everything together based on what I found in the bivariate plots section by creating a few multivariate plots to investigate more complex interactions between variables.

Using jitter plot, we can visualize the real distribution of BorrrowerRate relative to ProsperRating. In addition, I use different colors to indicate whether the individual borrower is a homeowner. This plot shows that most of the “AA” rated borrowers are also homeowners.

In this plot, I found evidence that each rating has its own colored area that is easily recognized, which indicates that ProsperRating is a good predictor for the BorrowerRate. Also, highly rated loans (i.e. “AA”, “A”, and “B”) can have loan original amount over $30,000. On the other hand, most lowly rated loans (i.e. “E” and “HR”) are below $10,000.

## # A tibble: 6 x 3
## # Groups:   ProsperRating [1]
##   ProsperRating ListingCategory    BorrowerRate
##   <fct>         <fct>                     <dbl>
## 1 AA            NA                       0.0655
## 2 AA            Debt Consolidation       0.0800
## 3 AA            Home Improvement         0.0789
## 4 AA            Business                 0.0823
## 5 AA            Student Use              0.0821
## 6 AA            Auto                     0.0759

This plot shows the relative value of the average BorrowerRates of different ListCategory.

Compared to the previous plot, the boxplots give us more information about the outliers and range of the distribution. However, it’s more difficult to compare mean values between different Listingcategory.

Based on this plot, ‘Boat’, ‘Green loans’, and ‘RV’ categories have less data points. As a result, these catefories have less outliers as seen in the previous plots.

The heat map shows that highly rated “Not employed” borrowers have to pay slight higher rates. The jitter plots shows that even “Not employed”, “Retired”, or “Part-time” borrowers can get loan with rates lower than 0.2.

## 
## 
##           mean      
## --------  ----------
## 1 year    0.1500807 
## 3 years   0.1934855 
## 5 years   0.1929907
## 
## 
##           standard deviation 
## --------  -------------------
## 1 year    0.0678582          
## 3 years   0.0792523          
## 5 years   0.0556659

This chart shows a closer look at BorrowerRate and ProsperRating in which most loans have either a 3-year or 5-year term with higher borrower rates. On the other hand, the main difference between 3-year and 5-year term loans is deviation.

There’s evidence that Prosper optimized their rating model throughout the year, as we see the borrower and the variation between borrower rate is not that significant anymore. As a result, we tend to have smaller standard deviation year-over-year. Something to note is the amount of borrowing that suddenly decreased in 2013.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

A feature that strengthened each other in terms of looking at the features of interest is Term, as this is a good indicator whether we have a better BorrowerRate or not. In addition, we see how variables like BorrowerRate and ProsperRating come together and how it affect each other.

Were there any interesting or surprising interactions between features?

The criteria for being an AA borrowers seems to be tighten year-over-year. Also, there seems to be a fixed borrower rate for both the HR and AA criteria.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No, I did not create any models with this dataset.


Final Plots and Summary

At this point, I’ve done some exploration and have built up an understanding of the structure of and relationships between the variables in this Prosper Loan dataset. Next, I will select three plots from all of my previous exploration to present here as a summary of some of my most interesting findings.

Plot One

Description One

The Prosper rating is an important factor for the borrower rate. In other words, a high Prosper rating is linked to lower borrower rates and vice versa.

Plot Two

Description Two

Based on this graph, highly rated loans (i.e. “AA”, “A”, and “B”) can have original amount over $30,000, and most lowly rated loans (i.e. “E” and “HR”) are under 10,000. As a results, Prosper rating is an great indicator of the borrower rate. As you can see, the different colored areas can be easily recognized.

Plot Three

Description Three

This graph shows eveidence that a 3-year term loan will have higher borrower rates when compared to a 1-year term loan in each Prosper rating. In addition, loans with a 5-year term are more concentrated than a 3-year term loan and has a smaller total range as well.


Reflection

Regarding my initial observations on the Prosper loan dataset containing up to 81 variables, I was considering changing datasets because I believed the high variable count would make it hard to explore and analyze. However, after examining the dataset’s variable dictionary that explained the variables in the data set, I was able to determine which variables might be interesting to explore. Furthermore, I researched and explored each variable primarily focusing on features related to the borrower rates and statuses. Next, before diving into my exploratory data analysis, I took a subset of the loan data, creating a smaller dataset in which reduced the variable count to 15.

In my updated dataset, there were only 7 numerical variables. Initially, my primary focus was to start analyzing the borrower rate and the various borrower statuses. However, I decided to go through each variable one by one, trying to understand any relationships between the variables in the dataset since I’m currently working in the financial industry. As a result, most of the data visualizations I created included at least one categorical variable.

I did spend a lot of time researching how to create visualization with one, two, or more categorical variables. One of my main struggles, which is not uncommon, was the material that I’ve learned up to this point were outdated, so I ended up doing additional research on the latest and up-to-date libraries and functions needed to complete this project. Based on my research and what I’ve learned, my solution was to utilize boxplots, bar charts, jitter plots, heat maps, and a variety of libraries/functions to create the visualizations needed for this project.

In conclusion of my project, my analysis shows that Prosper rating and term are two major indicators of the borrower rate. In addition, borrower statuses also have impacts on borrower rate, but their effects are either small or are also factors of Prosper rating. Overall, I think this project would be a good reference for anyone who wants to borrower money from a P2P lending platform. For future work and to expand the project, I can try to combine various variables and make a predictive model using logistic regression for a go/no-go investment decision or any classification algorithm would make sense. Overall, I enjoyed this project. I will say that my exploratory data analysis did give me better insight into the lending business.